Blog

Engineering notes on custom kernels, local inference, and hardware design.

PFlash speculative prefill compression for Dflash

PFlash: 10× prefill speedup over llama.cpp at 128K on a RTX 3090

Long context overwhelms Q4 27B targets on 24 GB GPUs. PFlash compresses 128K → 2.6K with a small drafter before dflash sees the prompt. Head-to-head cold-vs-cold: 24.8 s TTFT vs ~257 s llama.cpp (10.4×); NIAH retrieval preserved at every measured context.

Qwen3.5-27B DFlash on ggml

DFlash on ggml: up to 207 tok/s Qwen3.5-27B on a RTX 3090

Standalone C++/ggml speculative decoder for Qwen3.5-27B Q4_K_M with DFlash block-diffusion draft + DDtree verifier. 3.43x AR, 2.8x SGLang AWQ, 128K context on 24 GB.

RTX 3090 + eGPU dock + MacBook, running NVIDIA on macOS over USB4

The eGPU Myth: Why a ~$300 Dock Won't Turn Your GPU Into an AI Workstation

tinygrad wrote an NVIDIA driver from scratch. We ran real models on an RTX 3090 over USB4. The engineering is brilliant. The numbers aren't there yet. Full benchmarks and profiling.

RTX 3090, the GPU behind the megakernel

Megakernel: Matching Apple Silicon Efficiency at 2x the Throughput on a RTX 3090

The first megakernel for hybrid DeltaNet/Attention LLMs. All 24 layers fused into a single CUDA dispatch. 1.87 tok/J, matching M5 Max efficiency at 1.8x the throughput on a 2020 GPU.